Goto

Collaborating Authors

 gradient norm


A PAC-Bayesian View of Generalisation for Physics-Informed Machine Learning

arXiv.org Machine Learning

Physics-informed machine learning (PIML) integrates mechanistic knowledge, typically in the form of partial differential equations (PDE), into data-driven models. Despite strong empirical performance, its statistical generalisation properties remain poorly understood, particularly in the regression setting with unbounded losses. Existing analyses rely on approximation or stability arguments and do not fully capture how physical structure influences generalisation from finite data. In this work, we develop a PAC-Bayesian framework for PIML that provides high-probability generalisation guarantees in the presence of unbounded losses. We adopt a multi-task perspective that jointly treats data fidelity, PDE residuals, initial and boundary conditions, avoiding the looseness induced by standard union-bound approaches. Our analysis leverages the structure of physics-informed objectives to derive novel bounds where the complexity scales with input-gradient norms of the losses, revealing a direct link between physical regularity and generalisation. We instantiate this framework under Sobolev and Poincarรฉ-type assumptions, yielding two classes of bounds that trade off statistical complexity and smoothness in different regimes. Building on these results, we propose a self-bounding-aware learning algorithm that directly optimises tractable surrogates of the derived bounds, along with a practical procedure to estimate the associated constants in realistic settings. Empirical evaluations on standard PDE benchmarks demonstrate that our bounds are non-vacuous, significantly tighter than union-bound baselines, and can be effectively minimised during training. Overall, our results provide a principled statistical foundation for the generalisation of physics-informed models.


ScheduleFree+: Scaling Learning-Rate-Free & Schedule-Free Learning to Large Language Models

arXiv.org Machine Learning

Schedule-Free Learning has shown promise as a practical anytime training method for machine learning, showing success across dozens of standard benchmark problems. However, strong performance for LLM training has only been demonstrated at small scales. We identify a number of fixes necessary to scale up Schedule-Free Learning to larger batch sizes and model sizes, and present a learning-rate-free and schedule-free method (ScheduleFree+) for training large language models which greatly outperforms Warmup-Stable-Decay (WSD) schedules. We also demonstrate that Schedule-Free Learning is most effective for long duration training, and at 1000 tokens per parameter, it outperforms SOTA schedules by 31%. Schedule-Free Learning provides a theoretical foundation for the use of model averaging and checkpoint merging during pretraining.


Robust and Fast Training via Per-Sample Clipping

arXiv.org Machine Learning

We propose a robust gradient estimator based on per-sample gradient clipping and analyze its properties both theoretically and empirically. We show that the resulting method, per-sample clipped SGD (PS-Clip-SGD), achieves optimal in-expectation convergence rates for non-convex optimization problems under heavy-tailed gradient noise. Moreover, we establish high-probability convergence guarantees that match the in-expectation rates up to polylogarithmic factors in the failure probability. We complement our theoretical results with multiple numerical experiments. In particular, we demonstrate that PS-Clip-SGD outperforms both vanilla SGD with momentum and standard gradient clipping when training AlexNet on the CIFAR-100 dataset, even after accounting for the additional computational time caused by per-sample clipping. We also empirically show that, in the presence of gradient accumulation, applying clipping at the mini-batch level can improve training performance while incurring virtually no additional computational cost. This finding is particularly interesting, as it contradicts the common practice of applying clipping only after all accumulation steps have been completed.


AMissing Proofs Theorem 1. The excessive loss of a group a Ais upper bounded by3: R(a) gโ„“a ฮธ ฮธ + 1 2 ฮป Hโ„“a ฮธ ฮธ

Neural Information Processing Systems

J( ฮธ; Da) is the Hessian matrix of the loss function โ„“, at the optimal parameters vector ฮธ, computed using the group data Da (henceforth simply referred to as group hessian), and ฮป(ฮฃ) is the maximum eigenvalue of a matrix ฮฃ. Proof. Using a second order Taylor expansion around ฮธ, the excessive loss R(a) for a group a A can be stated as: R(a) = J( ฮธ; Da) J( ฮธ; Da) = " J ฮธ; Da + ฮธ ฮธ Hโ„“a ฮธ ฮธ +O ฮธ ฮธ 3 The above, follows from the loss โ„“() being at least twice differentiable, by assumption. Consider two groups a and b in Awith |Da| |Db|. Proposition 2. For a given group a A, gradient norms can be upper bounded as: gโ„“a O X The above proposition is presented in the context of cross entropy loss or mean squared error loss functions. These two cases are reviewed as follows 3With a slight abuse of notation, the results refer to ฮธ as the homonymous vector which is extended with k k zeros.